15 research outputs found
diBELLA: Distributed Long Read to Long Read Alignment
We present a parallel algorithm and scalable implementation for genome
analysis, specifically the problem of finding overlaps and alignments for data
from "third generation" long read sequencers. While long sequences of DNA offer
enormous advantages for biological analysis and insight, current long read
sequencing instruments have high error rates and therefore require different
approaches to analysis than their short read counterparts. Our work focuses on
an efficient distributed-memory parallelization of an accurate single-node
algorithm for overlapping and aligning long reads. We achieve scalability of
this irregular algorithm by addressing the competing issues of increasing
parallelism, minimizing communication, constraining the memory footprint, and
ensuring good load balance. The resulting application, diBELLA, is the first
distributed memory overlapper and aligner specifically designed for long reads
and parallel scalability. We describe and present analyses for high level
design trade-offs and conduct an extensive empirical analysis that compares
performance characteristics across state-of-the-art HPC systems as well as a
commercial cloud architectures, highlighting the advantages of state-of-the-art
network technologies.Comment: This is the authors' preprint of the article that appears in the
proceedings of ICPP 2019, the 48th International Conference on Parallel
Processin
10 Years Later: Cloud Computing is Closing the Performance Gap
Can cloud computing infrastructures provide HPC-competitive performance for
scientific applications broadly? Despite prolific related literature, this
question remains open. Answers are crucial for designing future systems and
democratizing high-performance computing. We present a multi-level approach to
investigate the performance gap between HPC and cloud computing, isolating
different variables that contribute to this gap. Our experiments are divided
into (i) hardware and system microbenchmarks and (ii) user application proxies.
The results show that today's high-end cloud computing can deliver
HPC-competitive performance not only for computationally intensive applications
but also for memory- and communication-intensive applications - at least at
modest scales - thanks to the high-speed memory systems and interconnects and
dedicated batch scheduling now available on some cloud platforms
The Parallelism Motifs of Genomic Data Analysis
Genomic data sets are growing dramatically as the cost of sequencing
continues to decline and small sequencing devices become available. Enormous
community databases store and share this data with the research community, but
some of these genomic data analysis problems require large scale computational
platforms to meet both the memory and computational requirements. These
applications differ from scientific simulations that dominate the workload on
high end parallel systems today and place different requirements on programming
support, software libraries, and parallel architectural design. For example,
they involve irregular communication patterns such as asynchronous updates to
shared data structures. We consider several problems in high performance
genomics analysis, including alignment, profiling, clustering, and assembly for
both single genomes and metagenomes. We identify some of the common
computational patterns or motifs that help inform parallelization strategies
and compare our motifs to some of the established lists, arguing that at least
two key patterns, sorting and hashing, are missing
LOGAN: High-Performance GPU-Based X-Drop Long-Read Alignment
Pairwise sequence alignment is one of the most computationally intensive
kernels in genomic data analysis, accounting for more than 90% of the runtime
for key bioinformatics applications. This method is particularly expensive for
third-generation sequences due to the high computational cost of analyzing
sequences of length between 1Kb and 1Mb. Given the quadratic overhead of exact
pairwise algorithms for long alignments, the community primarily relies on
approximate algorithms that search only for high-quality alignments and stop
early when one is not found. In this work, we present the first GPU
optimization of the popular X-drop alignment algorithm, that we named LOGAN.
Results show that our high-performance multi-GPU implementation achieves up to
181.6 GCUPS and speed-ups up to 6.6x and 30.7x using 1 and 6 NVIDIA Tesla V100,
respectively, over the state-of-the-art software running on two IBM Power9
processors using 168 CPU threads, with equivalent accuracy. We also demonstrate
a 2.3x LOGAN speed-up versus ksw2, a state-of-art vectorized algorithm for
sequence alignment implemented in minimap2, a long-read mapping software. To
highlight the impact of our work on a real-world application, we couple LOGAN
with a many-to-many long-read alignment software called BELLA, and demonstrate
that our implementation improves the overall BELLA runtime by up to 10.6x.
Finally, we adapt the Roofline model for LOGAN and demonstrate that our
implementation is near-optimal on the NVIDIA Tesla V100s
New Generation of Educators Initiative: Transforming teacher preparation.
The focus of the New Generation of Educators Initiative (NGEI) was to answer the question "What would it take to transform teacher education?" From 2016 to 2019, with support from the S. D. Bechtel, Jr. Foundation, teacher education programs at 10 California State University (CSU) campuses partnered with local school districts to design and demonstrate innovative practices that could transform teacher preparation. This report documents the learnings from multiple participants in this transformative work, including Foundation program staff and representatives from partnerships between universities and school districts
Recommended from our members
Parallelizing Irregular Applications for Distributed Memory Scalability: Case Studies from Genomics
Generalizable approaches, models, and frameworks for irregular application scalability is an old yet open area in parallel and distributed computing research. Irregular applications are particularly hard to parallelize and distribute because, by definition, the pattern of computation is dependent upon the input data. With the proliferation of data-driven and data-intensive applications from the realm of Big Data, and the increasing demand for and availability of large-scale computing resources through HPC-Cloud convergence, the importance of generalized approaches to achieving irregular application scalability is only growing. Rather than offering another software language or framework, this dissertation argues we first need to understand application scalability, especially irregular application scalability, and more closely examine patterns of computation, data sharing, and dependencies. As it stands, predominant performance models and tools from parallel and distributed computing focus on applications that are divided into distinct communication and computation phases, and ignore issues related to memory utilization. While time-tested and valuable, these models are not always sufficient for understanding full application scalability, particularly, the scalability of data-intensive irregular applications. We present application case studies from genomics, highlighting the interdependencies of communication, computation, and memory capacities and performance. The genomics applications we will examine offer a particularly useful and practical vantage point for this analysis, as they are data-intensive irregular application targets for both HPC and cloud computing. Further, they present an extreme for both domains. For HPC, they are less akin to traditional, well-studied and well-supported scientific simulations and more akin to text and document analysis applications. For cloud computing, they are an extreme in that they require frequent random global access to memory and data, stressing interconnection network latency and bandwidth and co-scheduled processors for tightly orchestrated computation. We show how common patterns of irregular all-to-all computation can be managed efficiently, comparing bulk-synchronous approaches built on collective communication and asynchronous approaches based on one-sided communication. For the former, our work is based on the popular Message Passing Interface (MPI) and makes heavy use of globally collective communication operations that exchange data across processors in a single step or, to save memory use, in a set of irregular steps. For the latter, we build on the UPC++ programming framework, which provides lightweight RPC mechanisms, to transfer both data and computational work between processors. We present performance results across multiple platforms including several modern HPC systems and, at least in one case, a cloud computing platform. With these application case studies, we seek not only to contribute to discussions around parallel algorithm and data structure design, programming systems, and performance modeling within the parallel computing community, but also to contribute to broader work in genomics through software development and analysis. Thus, we develop and present the first distributed memory scalable software for analyzing data sets from the latest generation of sequencing technologies, known as long read data sets. Specifically, we present scalable solutions to the problem of many-to-many long read overlap and alignment, the computational bottleneck to long read assembly, error correction, and direct analysis. Through cross-architectural empirical analysis, we identify the key components to efficient scalability, and highlight the priorities for any future optimization with analytical models
Recommended from our members
Parallelizing Irregular Applications for Distributed Memory Scalability: Case Studies from Genomics
Generalizable approaches, models, and frameworks for irregular application scalability is an old yet open area in parallel and distributed computing research. Irregular applications are particularly hard to parallelize and distribute because, by definition, the pattern of computation is dependent upon the input data. With the proliferation of data-driven and data-intensive applications from the realm of Big Data, and the increasing demand for and availability of large-scale computing resources through HPC-Cloud convergence, the importance of generalized approaches to achieving irregular application scalability is only growing. Rather than offering another software language or framework, this dissertation argues we first need to understand application scalability, especially irregular application scalability, and more closely examine patterns of computation, data sharing, and dependencies. As it stands, predominant performance models and tools from parallel and distributed computing focus on applications that are divided into distinct communication and computation phases, and ignore issues related to memory utilization. While time-tested and valuable, these models are not always sufficient for understanding full application scalability, particularly, the scalability of data-intensive irregular applications. We present application case studies from genomics, highlighting the interdependencies of communication, computation, and memory capacities and performance. The genomics applications we will examine offer a particularly useful and practical vantage point for this analysis, as they are data-intensive irregular application targets for both HPC and cloud computing. Further, they present an extreme for both domains. For HPC, they are less akin to traditional, well-studied and well-supported scientific simulations and more akin to text and document analysis applications. For cloud computing, they are an extreme in that they require frequent random global access to memory and data, stressing interconnection network latency and bandwidth and co-scheduled processors for tightly orchestrated computation. We show how common patterns of irregular all-to-all computation can be managed efficiently, comparing bulk-synchronous approaches built on collective communication and asynchronous approaches based on one-sided communication. For the former, our work is based on the popular Message Passing Interface (MPI) and makes heavy use of globally collective communication operations that exchange data across processors in a single step or, to save memory use, in a set of irregular steps. For the latter, we build on the UPC++ programming framework, which provides lightweight RPC mechanisms, to transfer both data and computational work between processors. We present performance results across multiple platforms including several modern HPC systems and, at least in one case, a cloud computing platform. With these application case studies, we seek not only to contribute to discussions around parallel algorithm and data structure design, programming systems, and performance modeling within the parallel computing community, but also to contribute to broader work in genomics through software development and analysis. Thus, we develop and present the first distributed memory scalable software for analyzing data sets from the latest generation of sequencing technologies, known as long read data sets. Specifically, we present scalable solutions to the problem of many-to-many long read overlap and alignment, the computational bottleneck to long read assembly, error correction, and direct analysis. Through cross-architectural empirical analysis, we identify the key components to efficient scalability, and highlight the priorities for any future optimization with analytical models
Recommended from our members
Parallel String Graph Construction and Transitive Reduction for De Novo Genome Assembly
One of the most computationally intensive tasks in computational biology is
de novo genome assembly, the decoding of the sequence of an unknown genome from
redundant and erroneous short sequences. A common assembly paradigm identifies
overlapping sequences, simplifies their layout, and creates consensus. Despite
many algorithms developed in the literature, the efficient assembly of large
genomes is still an open problem. In this work, we introduce new
distributed-memory parallel algorithms for overlap detection and layout
simplification steps of de novo genome assembly, and implement them in the
diBELLA 2D pipeline. Our distributed memory algorithms for both overlap
detection and layout simplification are based on linear-algebra operations over
semirings using 2D distributed sparse matrices. Our layout step consists of
performing a transitive reduction from the overlap graph to a string graph. We
provide a detailed communication analysis of the main stages of our new
algorithms. diBELLA 2D achieves near linear scaling with over 80% parallel
efficiency for the human genome, reducing the runtime for overlap detection by
1.2-1.3x for the human genome and 1.5-1.9x for C. elegans compared to the
state-of-the-art. Our transitive reduction algorithm outperforms an existing
distributed-memory implementation by 10.5-13.3x for the human genome and 18-29x
for the C. elegans. Our work paves the way for efficient de novo assembly of
large genomes using long reads in distributed memory